Logistic Regression
Machine Learning
Classification
The linear regression model discussed in previous lesson assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative. For example, eye color is qualitative, taking on values blue, brown, or green.
Often qualitative variables are referred to as categorical; we will use these terms interchangeably. In this lesson, we study approaches for predicting qualitative responses, a process that is known as classification. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, often the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods.
Examples of Classification: Classification problems occur often, perhaps even more so than regression problems.
A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
Consider the following example:
Brand Preference for Orange Juice: Citrus Hill Vs Minute Maid
- We would like to predict what customers prefer to buy: Citrus Hill or Minute Maid orange juice?
- The Y(Purchase) variable is categorical: 0 or 1
- The X (LoyalCH) variable is a numerical value (between 0 and 1) which specifies how much the customers are loyal to the Citrus Hill (CH) orange juice.
- Can we use Linear Regression when Y is categorical?
Why not Linear Regression?
When Y is only takes on values of 0 and 1, why standard linear regression is inappropriate?
- The regression line \(\beta_0 +\beta_1 X\) can only take on any values between negative and positive infinity.
- In the orange juice classification problem, Y can only take on two possible values:0 or 1.
Solution:
Instead of trying to predict Y, let’s try to predict the probability that a customer buys the Citrus Hill (CH) juice, i.e., let’s try to predict \(P(Y=1)\).
If we try to calculate the above probability, the function \(P(Y=1)\) give the outputs between 0 and 1. Which we can interpret.
For example: it is more valuable to have an estimate of the probability that an insurance claim is fraudulent, than a classification fraudulent or not.
Example: Self-reported heights in inches for males and females.
Let’s provide a prediction for a student that is 66 inches tall.
What is the conditional probability of being female if you are 66 inches tall?
In our dataset, we can estimate this by rounding to the nearest inch and computing:
To construct a prediction algorithm, we want to estimate the proportion of the population that is female for any given height \(𝑋=𝑥\), which we write as the conditional probability described above
\[P(𝑌=1|𝑋=𝑥)\]
Since the results from the plot above look close to linear, and it is the only approach we currently know, we will try regression.
We assume that: \[P(x)=P(𝑌=1|𝑋=)=\beta_0 +\beta_1 X \]
Then our estimate of the conditional
probability \(p(x)\) is \(\hat p(x)=\hat\beta_0 +\hat\beta_1X\)
Then the Decision rule: predict female if \(\hat p(x)>0.5\)
Another problem:
Here we are estimating a probability:\(Pr(Y=1|X=x)\) which is constrained between 0 and 1.
The idea of generalized linear models (GLM) is to 1) define a distribution of \(Y\) that is consistent with it’s possible outcomes and 2) find a function \(g\) so that \(g(\operatorname{Pr}(Y=1 \mid X=x))\) can be modeled as a linear combination of predictors. Logistic regression is the most commonly used GLM. It is an extension of linear regression that assures that the estimate of \(\operatorname{Pr}(Y=1 \mid X=x)\) is between 0 and 1 . This approach makes use of the logistic transformation defined as:
\[g(p)=\log \frac{p}{1-p} \] This logistic transformation converts probability to log odds.
The odds : tell us how much more likely it is something will happen compared to not happening. \(p=0.5\) means the odds are 1 to 1, thus the odds are 1. If \(p=0.75\), the odds are 3 to 1.
A nice characteristic of this transformation is that it converts probabilities to be symmetric around 0.
Logistic Regression
With logistic regression, we model the conditional probability directly with:
\[g\{\operatorname{Pr}(Y=1 \mid X=x)\}=\beta_0+\beta_1 x\] With this model, we can no longer use least squares. Instead we compute the maximum likelihood estimate (MLE).
\[ \ell\left(\beta_0, \beta\right)=\prod_{i: y_i=1} p\left(x_i\right) \prod_{i: y_i=0}\left(1-p\left(x_i\right)\right) \] This likelihood gives the probability of the observed zeros and ones in the data. We pick \(\beta_0\) and \(\beta_1\) to maximize the likelihood of the observed data.
\[p(x)=P(Y=1|X=x)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}\] or you can re-write the above formula as follows:
\[\log \left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1 X\]
This monotone transformation is called the ‘log odds’ or ‘logit’ transformation of \(p(x)\).
Here for the above example both linear and logistic regressions
provide an estimate for the conditional expectation: \[
E(Y \mid X=x)
\] which in the case of binary data is equivalent to the
conditional probability: \[
\operatorname{Pr}(Y=1 \mid X=x)
\]
Example :Credit Card Default Data:
A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.
We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance.
With this example we will try to understand what happens to the target variable based on the predictor variables.
library(ISLR)
library(tidyverse)
library("reshape2")data("Default")
str(Default)## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
attach(Default)summary(Default)## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
Let’s try to explore the relationship between the default status and other variables data visualize using ggplot, therefore need to rearrange the dataset.
df.m <- melt(Default[,c(1,3,4)], id.var = "default")
ggplot(data = df.m, aes(x=default, y=value)) +
geom_boxplot(aes(fill=default)) + facet_wrap(~variable)Let’s free the scales:
ggplot(data = df.m, aes(x=default, y=value)) +
geom_boxplot(aes(fill=default)) + facet_wrap(~variable,scales="free")ggplot(data=Default,aes(x=student))+geom_bar(aes(fill=default))The data is unbalanced, the probability of defaulting is low.
Balance and student status seem to have a
significant effect on default status.
For this data set, if we try to fit a linear regression model,model could lead to a nonsensical result for some values of balance because we know that probabilities are limited to values in the closed interval [0 1].
Default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "lm") +
ggtitle("Linear regression model fit") +
xlab("Balance") +
ylab("Probability of Default")Now let’s try to fit a logistic regression model for this data set only with the balnce variable.
Since we have two classes for the target variable we set
family = "binomial'.
glm.balance=glm(default~balance,data=Default,family="binomial") If we try to visualize the model:
Default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit") +
xlab("Balance") +
ylab("Probability of Default")summary(glm.balance)##
## Call:
## glm(formula = default ~ balance, family = "binomial", data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2697 -0.1465 -0.0589 -0.0221 3.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
## balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 8
The summary can be interpreted similarly to linear regression. The Estimates are estimated values for \(\beta_0\) and \(\beta_1\) Standard error measures the variation of our estimator for different samples, The last column is the p-value, a low value for this column means the corresponding predictor has a significant effect.
Deviance is analogous to the sum of squares calculations in linear regression and is a measure of the lack of fit to the data in a logistic regression model. The null deviance represents the difference between a model with only the intercept (which means “no predictors”) and a saturated model (a model with a theoretically perfect fit).
The goal is for the model deviance (noted as Residual deviance) to be lower; smaller values indicate better fit. In this respect, the null model provides a baseline upon which to compare predictor models.
Keep in mind that the coefficient estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale.
exp(coef(glm.balance))## (Intercept) balance
## 2.366933e-05 1.005514e+00
Thus, we see that estimate of \(\beta_1 = 1.0055\) and this indicates that an increase in balance is associated with an increase in in the log odds of default by 1.0055 units.
We can further interpret the balance coefficient as - for every one dollar increase in monthly balance carried, the odds of the customer defaulting increases by a multiplicative factor of 1.0055.
95% confidence intervals for the coefficients:
confint(glm.balance)## 2.5 % 97.5 %
## (Intercept) -11.383288936 -9.966565064
## balance 0.005078926 0.005943365
Let’s try to see how the balance affect the default
status.
Making Predictions:
predict(glm.balance, data.frame(balance = c(1000, 2000)), type = "response")## 1 2
## 0.005752145 0.585769370
Here we compare the probability of defaulting based on balances of $1000 and $2000. As you can see as the balance moves from $1000 to $2000 the probability of defaulting increases significantly, from 0.5% to 58%!
One can also use qualitative predictors with the logistic regression
model. As an example, we can fit a model that uses the student variable.
Now using students as the predictor variable.
glm.student=glm(default~student,data=Default,family="binomial")
summary(glm.student)##
## Call:
## glm(formula = default ~ student, family = "binomial", data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2970 -0.2970 -0.2434 -0.2434 2.6585
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.50413 0.07071 -49.55 < 2e-16 ***
## studentYes 0.40489 0.11502 3.52 0.000431 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 2908.7 on 9998 degrees of freedom
## AIC: 2912.7
##
## Number of Fisher Scoring iterations: 6
Let’s try to see how the student status affect the
default status
predict(glm.student, data.frame(student = factor(c("Yes", "No"))),
type = "response")## 1 2
## 0.04313859 0.02919501
In fact, this model suggests that a student has nearly twice the odds of defaulting than non-students.
Now let us fit a model that predicts the probability of default based
on the balance, income (in thousands of
dollars), and student status variables.
glm.fits=glm(default~.,data=Default,family="binomial")
summary(glm.fits)##
## Call:
## glm(formula = default ~ ., family = "binomial", data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
There is a surprising result here. The p-values associated with
balance and student=Yes status are very small,
indicating that each of these variables is associated with the
probability of defaulting. However, the coefficient for the student
variable is negative, indicating that students are less likely to
default than non-students.
In contrast, the coefficient for the student variable in
the previous model, where we predicted the probability of default based
only on student status, indicated that students have a greater
probability of defaulting!
ggplot(Default,aes(y=balance,x=student,fill=student))+
geom_boxplot()+
xlab("Student status")+
ylab("Balance")This is because of the fact that the variables student and balance are correlated. Students tend to hold higher levels of debt, which is in turn associated with higher probability of default.
In other words, students are more likely to have large credit card balances, which, as we know from the first model, tend to be associated with high default rates. Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students.
Default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(x=balance,y=prob,grp=student,col=student)) +
geom_point(alpha = .15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit for different student status") +
xlab("Balance") +
ylab("Probability of Default")predict function calculates predicted probabilities for
given values of predictors.
glm.probs=predict(glm.fits,type="response")
glm.probs[1:10]## 1 2 3 4 5 6
## 1.428724e-03 1.122204e-03 9.812272e-03 4.415893e-04 1.935506e-03 1.989518e-03
## 7 8 9 10
## 2.333767e-03 1.086718e-03 1.638333e-02 2.080617e-05
A reasonable prediction rule may be to assign Yes if the predicted probability of yes is > 0.5.
glm.pred=rep("No",nrow(Default))
glm.pred[glm.probs>.5]="Yes"Now we can compare the predicted target variable versus the observed values.
We do it by using the confusion matrix, which is a table that describes the classification performance.
library(caret)
# needs to be factors with same levels
confusionMatrix(as.factor(glm.pred),default) ## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 9627 228
## Yes 40 105
##
## Accuracy : 0.9732
## 95% CI : (0.9698, 0.9763)
## No Information Rate : 0.9667
## P-Value [Acc > NIR] : 0.0001044
##
## Kappa : 0.4278
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9959
## Specificity : 0.3153
## Pos Pred Value : 0.9769
## Neg Pred Value : 0.7241
## Prevalence : 0.9667
## Detection Rate : 0.9627
## Detection Prevalence : 0.9855
## Balanced Accuracy : 0.6556
##
## 'Positive' Class : No
##
But training misclassification rate does not give the right estimation for prediction accuracy.
To compare different models we need to compare test errors. Usual practice 80% for training and 20% FOR TESTING.
Now let’s divide our data set in to training and testing and re calculate the Testing error rates.
train=sample(c(TRUE,FALSE),size=nrow(Default),prob=c(0.6,0.4),replace=TRUE)
Default.test=Default[!train,]
dim(Default.test)## [1] 4033 4
default.test=default[!train]Here we are fitting all three models that we used earlier to see which is the better model.
model1=glm(default~balance,data=Default,family=binomial,subset=train)
model2=glm(default~student,data=Default,family=binomial,subset=train)
model3=glm(default~.,data=Default,family=binomial,subset=train)Calculating the prediction for testing set.
test.predicted.m1 <- predict(model1, Default.test, type = "response")
test.predicted.m2 <- predict(model2, Default.test, type = "response")
test.predicted.m3 <- predict(model3, Default.test, type = "response")Re-coding Yes and No
glm.pred1=rep("No",nrow(Default.test))
glm.pred2=rep("No",nrow(Default.test))
glm.pred3=rep("No",nrow(Default.test))
glm.pred1[test.predicted.m1>.5]="Yes"
glm.pred2[test.predicted.m2>.5]="Yes"
glm.pred3[test.predicted.m3>.5]="Yes"list(
table1 = round(table(default.test, glm.pred1)/nrow(Default.test),3),
table2 = round(table(default.test, glm.pred2)/nrow(Default.test),3),
table3 = round(table(default.test, glm.pred3)/nrow(Default.test),3)
)## $table1
## glm.pred1
## default.test No Yes
## No 0.963 0.004
## Yes 0.022 0.011
##
## $table2
## glm.pred2
## default.test No
## No 0.967
## Yes 0.033
##
## $table3
## glm.pred3
## default.test No Yes
## No 0.963 0.004
## Yes 0.022 0.011
The results show that model1 and model3 are very similar. 96% of the predicted observations are true negatives and about 1% are true positives.
Both models have a type II error of less than 2.5% in which the model predicts the customer will not default but they actually did and both models have a type I error of less than 0.5% in which the models predicts the customer will default but they never did.
model2 results are notably different; this model accurately predicts the non-defaulters (a result of 97% of the data being non-defaulters) but never actually predicts those customers that default!
m1.error = mean(default.test != glm.pred1)
m2.error = mean(default.test != glm.pred2)
m3.error = mean(default.test != glm.pred3)
c(m1.error,m2.error,m3.error)## [1] 0.02529135 0.03297793 0.02628316
Looking at the misclassification rates, there is not much improvement between models 1 and 3 and although model 2 has a low error rate, it never accurately predicts customers that actually default.
Use the following link to read about Multinomial Logistic Regression: link
Advantages and Disadvantages of Logistic Regression
Advantages
Logistic regression is easier to implement, interpret, and very efficient to train.
It makes no assumptions about distributions of classes in feature space.
It can easily extend to multiple classes (multinomial regression) and a natural probabilistic view of class predictions.
It not only provides a measure of how appropriate a predictor (coefficient size)is, but also its direction of association (positive or negative).
It is very fast at classifying unknown records.
Good accuracy for many simple data sets and it performs well when the dataset is linearly separable.
It can interpret model coefficients as indicators of feature importance.
Logistic regression is less inclined to over-fitting but it can overfit in high dimensional datasets. One may consider Regularization (L1 and L2) techniques to avoid over-fitting these scenarios.
Disadvantages
If the number of observations is lesser than the number of features, Logistic Regression should not be used, otherwise, it may lead to overfitting.
It constructs linear boundaries.
The major limitation of Logistic Regression is the assumption of linearity between the dependent variable and the independent variables.
It can only be used to predict discrete functions. Hence, the dependent variable of Logistic Regression is bound to the discrete number set.
Non-linear problems can’t be solved with logistic regression because it has a linear decision surface. Linearly separable data is rarely found in real-world scenarios.
Logistic Regression requires average or no multicollinearity between independent variables.
It is tough to obtain complex relationships using logistic regression. More powerful and compact algorithms such as Neural Networks can easily outperform this algorithm.
In Linear Regression independent and dependent variables are related linearly. But Logistic Regression needs that independent variable are linearly related to the log odds \((log(p/(1-p))\).
Example: MNIST data set: Is it 2 or 7?
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
For this example we only include a randomly selected set of 2s and 7s along with the two predictors based on the proportion of dark pixels in the upper left and lower right quadrants respectively.
The dataset is divided into training and test sets.
library(dslabs)
library(tidyverse)
library(caret)
data("mnist_27")Let’s visualize the data set:
mnist_27$train %>% ggplot(aes(x_1, x_2, color = y)) + geom_point()Now let’s fit a model using Training data set and use Testing data set to calculate the accuracy rates.
fit_glm <- glm(y ~ x_1 + x_2, data=mnist_27$train, family = "binomial")
p_hat_glm <- predict(fit_glm, mnist_27$test, type="response")Re-coding:
y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2))
confusionMatrix(y_hat_glm, mnist_27$test$y)## Confusion Matrix and Statistics
##
## Reference
## Prediction 2 7
## 2 82 26
## 7 24 68
##
## Accuracy : 0.75
## 95% CI : (0.684, 0.8084)
## No Information Rate : 0.53
## P-Value [Acc > NIR] : 1.266e-10
##
## Kappa : 0.4976
##
## Mcnemar's Test P-Value : 0.8875
##
## Sensitivity : 0.7736
## Specificity : 0.7234
## Pos Pred Value : 0.7593
## Neg Pred Value : 0.7391
## Prevalence : 0.5300
## Detection Rate : 0.4100
## Detection Prevalence : 0.5400
## Balanced Accuracy : 0.7485
##
## 'Positive' Class : 2
##
Decision boundary:
p_hat <- predict(fit_glm, newdata = mnist_27$true_p, type = "response")
mnist_27$true_p %>% mutate(p_hat = p_hat) %>%
ggplot(aes(x_1, x_2, z=p_hat, fill=p_hat)) +
geom_raster() +
scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) +
stat_contour(breaks=c(0.5), color="black")K - Nearest Neighbors
K Nearest Neighbors is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. It is mostly used to classifies a data point based on how its neighbors are classified.
Simple, but a very powerful classification algorithm.
Classifies based on a similarity measure.
Non- Parametric
Lazy learning :
Does not “learn” until the test example is given
Whenever we have a new data to classify, we find its K-nearest neighbors from the training data.
Classified by “Majority Votes” or its neighbor classes. (Assigned to the most common class amongst its K-nearest neighbors by measuring ‘distance’ between data.
KNN Algorithm Pseudocode:
Step 1: Determine parameter K = number of nearest neighbors
Step 2: Calculate the distance between the query-instance and all the training examples.
Step 3: Sort the distances and determine nearest neighbors based on the k-th minimum distance.
Step 4: Gather the category Y of the nearest neighbors.
Step 5: Use simple majority of the category of nearest neighbors as the prediction values of the query instance.
Exmaple KNN in two dimensions
This is the true decision boundary between the two classes.
As you can see from the above plots it is crucial to select an appropriate K value for the model. Lower K values might lead to under-fitting and higher K values might lead to over-fitting.
Few ideas on picking a value for ‘K’
There is no structured method to find the best value for “K”. We need to find out with various values by trial and error and assuming that training data is unknown.
Choosing smaller values for K can be noisy and will have a higher influence on the result.
Larger values of K will have smoother decision boundaries which mean lower variance but increased bias. Also, computationally expensive.
Another way to choose K is though cross-validation. One way to select the cross-validation dataset from the training dataset. Take the small portion from the training dataset and call it a validation dataset, and then use the same to evaluate different possible values of K.
This way we are going to predict the label for every instance in the validation set using with K equals to 1, K equals to 2, K equals to 3.. and then we look at what value of K gives us the best performance on the validation set and then we can take that value and use that as the final setting of our algorithm so we are minimizing the validation error.
In general, practice, choosing the value of k is \(k = \sqrt N\) where N stands for the number of samples in your training dataset.
Try and keep the value of k odd in order to avoid confusion between two classes of data
When to use:
Less than 20 attributes per instance.
Lots of training data.
Pros:
Learning and implementation is extremely simple and intuitive.
Flexible decision boundaries.
Naturally handles multi-class cases
Can do well in practice with enough representative data.
Cons:
Need to determine the value of the parameter K
Irrelevant or correlated features have high impact and must be eliminated.
Typically difficult to handle high dimensionality.
Computational costs: memory and classification time computation.
Storage of data
Example: default data set: Cts…
We will use the library class for this.
library(caret)
library(ISLR)
library(class )First divide the sample in training and test
set.seed(27)
train=sample(c(TRUE,FALSE),size=nrow(Default),prob=c(0.6,0.4),replace=TRUE)
train.X=cbind(balance,income,student)[train,] # training only for x variables
test.X=cbind(balance,income,student)[!train,] # testing only for x variables
train.default=default[train] #training set Y variable
default.test=default[!train] # testing set Y variablek-nearest neighbor function with \(k=1\)
knn.pred=knn(train = train.X,test = test.X,cl = train.default,k = 1)
confusionMatrix(knn.pred, default.test)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3739 94
## Yes 81 37
##
## Accuracy : 0.9557
## 95% CI : (0.9488, 0.9619)
## No Information Rate : 0.9668
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.2744
##
## Mcnemar's Test P-Value : 0.3643
##
## Sensitivity : 0.9788
## Specificity : 0.2824
## Pos Pred Value : 0.9755
## Neg Pred Value : 0.3136
## Prevalence : 0.9668
## Detection Rate : 0.9463
## Detection Prevalence : 0.9701
## Balanced Accuracy : 0.6306
##
## 'Positive' Class : No
##
k-nearest neighbor function with \(k=4\)
knn.pred=knn(train.X,test.X,train.default,k=4)
confusionMatrix(knn.pred, default.test)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3789 105
## Yes 31 26
##
## Accuracy : 0.9656
## 95% CI : (0.9594, 0.971)
## No Information Rate : 0.9668
## P-Value [Acc > NIR] : 0.6912
##
## Kappa : 0.2618
##
## Mcnemar's Test P-Value : 3.857e-10
##
## Sensitivity : 0.9919
## Specificity : 0.1985
## Pos Pred Value : 0.9730
## Neg Pred Value : 0.4561
## Prevalence : 0.9668
## Detection Rate : 0.9590
## Detection Prevalence : 0.9856
## Balanced Accuracy : 0.5952
##
## 'Positive' Class : No
##
k-nearest neighbor function with \(k=10\)
knn.pred=knn(train.X,test.X,train.default,k=10)
confusionMatrix(knn.pred, default.test)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3815 121
## Yes 5 10
##
## Accuracy : 0.9681
## 95% CI : (0.9621, 0.9734)
## No Information Rate : 0.9668
## P-Value [Acc > NIR] : 0.3489
##
## Kappa : 0.1311
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99869
## Specificity : 0.07634
## Pos Pred Value : 0.96926
## Neg Pred Value : 0.66667
## Prevalence : 0.96684
## Detection Rate : 0.96558
## Detection Prevalence : 0.99620
## Balanced Accuracy : 0.53751
##
## 'Positive' Class : No
##
Choosing the optimal value for k.
Here we will use several values for the K and try to calculate the minimum testing error rates.
error<-rep(NA,20) # Placeholder
for(i in 1:20)
{
knn.pred=knn(train.X,test.X,train.default,k=i)
error[i]=mean(knn.pred!=default.test)
}
plot(error,type="b",xlab="K",ylab="Test Error")(loc=which.min(error))## [1] 5
Therefore, according to the above results, we select \(k=5\).
knn.pred=knn(train.X,test.X,train.default,k=5)
confusionMatrix(knn.pred, default.test)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3804 107
## Yes 16 24
##
## Accuracy : 0.9689
## 95% CI : (0.963, 0.9741)
## No Information Rate : 0.9668
## P-Value [Acc > NIR] : 0.255
##
## Kappa : 0.2694
##
## Mcnemar's Test P-Value : 4.857e-16
##
## Sensitivity : 0.9958
## Specificity : 0.1832
## Pos Pred Value : 0.9726
## Neg Pred Value : 0.6000
## Prevalence : 0.9668
## Detection Rate : 0.9628
## Detection Prevalence : 0.9899
## Balanced Accuracy : 0.5895
##
## 'Positive' Class : No
##
Example: Iris Data set
For this example we will use Edgar Anderson’s Iris Data.
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
data(iris)
str(iris)## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Let’s look at the data set:
summary(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
par(mfrow=c(2,2))
boxplot(Sepal.Length ~ Species, iris,
main = "Sepal Length wrt Species", col = "lightpink3")
boxplot(Sepal.Width ~ Species, iris,
main = "Sepal Width wrt Species", col = "antiquewhite1")
boxplot(Petal.Length ~ Species, iris,
main = "Petal Length wrt Species", col = "lightskyblue4")
boxplot(Petal.Width ~ Species, iris,
main = "Petal Width wrt Species", col = "orange1")From the above boxplots, you can identify the patterns among the three species.
plot(iris[,1:4],col=iris$Species)Now let’s try to fit a KNN model for the data set.
set.seed(299)
#Sample the Iris data set (70% train, 30% test)
ir_sample<-sample(1:nrow(iris),size=nrow(iris)*.7)
ir_train<-iris[ir_sample,] #Select the 70% of rows for training
ir_test<-iris[-ir_sample,] #Select the 30% of rows for testingiris_pred<-knn(train = ir_train[,-5],test =ir_test[,-5],
cl = ir_train$Species,k=10) #k=10
#Confusion matrix
table(ir_test$Species, iris_pred)## iris_pred
## setosa versicolor virginica
## setosa 17 0 0
## versicolor 0 13 2
## virginica 0 0 13
#Misclassification error
mean(iris_pred!=ir_test$Species)## [1] 0.04444444
confusionMatrix(ir_test$Species, iris_pred)## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 17 0 0
## versicolor 0 13 2
## virginica 0 0 13
##
## Overall Statistics
##
## Accuracy : 0.9556
## 95% CI : (0.8485, 0.9946)
## No Information Rate : 0.3778
## P-Value [Acc > NIR] : 2.61e-16
##
## Kappa : 0.9331
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.8667
## Specificity 1.0000 0.9375 1.0000
## Pos Pred Value 1.0000 0.8667 1.0000
## Neg Pred Value 1.0000 1.0000 0.9375
## Prevalence 0.3778 0.2889 0.3333
## Detection Rate 0.3778 0.2889 0.2889
## Detection Prevalence 0.3778 0.3333 0.2889
## Balanced Accuracy 1.0000 0.9688 0.9333
First let’s try to determine the optimal K value.
iris_acc<-rep(NA,50) # using 50 values for K
for(i in 1:50){
#Apply knn with k = i
predict<-knn(ir_train[,-5],ir_test[,-5],
ir_train$Species,k=i)
iris_acc[i]<-mean(predict==ir_test$Species) ## this is prediction accuracy
}Plot \(k=1\) through 50
plot(1-iris_acc,type="l",ylab="Error Rate",
xlab="K",main="Error Rate for Iris With Varying K")#minimum k value
which.min(1-iris_acc)## [1] 3
Here according to the plot you can see that there is no proper K values to be found. Error for the K values does not seems to behave properly.
Therefore, let’s try to calculate the value for K using many samples from the Iris data set.
Here we will test 20 K values ranging from 1-20 but we will use 100 samples of the data set to evaluate each K value.
trial_sum<-rep(0,20)
trial_n<-rep(0,20)set.seed(279)
for(i in 1:100){
ir_sample<-sample(1:nrow(iris),size=nrow(iris)*.7)
ir_train<-iris[ir_sample,]
ir_test<-iris[-ir_sample,]
test_size<-nrow(ir_test)
for(j in 1:20){
predict<-knn(ir_train[,-5],ir_test[,-5],
ir_train$Species,k=j)
trial_sum[j]<-trial_sum[j]+sum(predict==ir_test$Species)
trial_n[j]<-trial_n[j]+test_size
}
}plot(1-trial_sum / trial_n,type="l",ylab="Error Rate",
xlab="K",main="Error Rate for Iris With Varying K (100 Samples)")which.min(1-trial_sum / trial_n)## [1] 9
Therefore, we can use \(k=9\) for the data set.
iris_pred<-knn(train = ir_train[,-5],test =ir_test[,-5],
cl = ir_train$Species,k=10)
confusionMatrix(ir_test$Species, iris_pred)## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 16 0
## virginica 0 2 14
##
## Overall Statistics
##
## Accuracy : 0.9556
## 95% CI : (0.8485, 0.9946)
## No Information Rate : 0.4
## P-Value [Acc > NIR] : 2.842e-15
##
## Kappa : 0.933
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8889 1.0000
## Specificity 1.0000 1.0000 0.9355
## Pos Pred Value 1.0000 1.0000 0.8750
## Neg Pred Value 1.0000 0.9310 1.0000
## Prevalence 0.2889 0.4000 0.3111
## Detection Rate 0.2889 0.3556 0.3111
## Detection Prevalence 0.2889 0.3556 0.3556
## Balanced Accuracy 1.0000 0.9444 0.9677